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Abstract 

This paper presents an approach for establishing 
correspondences in time and in space between two dif- 
ferent video sequences of the same dynamic scene, 
recorded by stationary uncalibrated video cameras. 
The method simultaneously estimates both spatial 
alignment as well as temporal synchronization (tem- 
poral alignment) between the two sequences, using 
all available spatio-temporal information. Temporal 
variations between image frames (such as moving ob- 
jects or changes in scene illumination) are powerful 
cues for alignment, which cannot be exploited by stan- 
dard image-to-image alignment algorithms. We show 
that by folding spatial and temporal cues into a single 
alignment framework, situations which are inherently 
ambiguous for traditional image- to- image alignment 
methods, are often uniquely resolved by sequence-to- 
sequence alignment. 

We also present a "direct" method for sequence- 
to-sequence alignment. The algorithm simultaneously 
estimates spatial and temporal alignment parameters 
directly from measurable sequence quantities, without 
requiring prior estimation of point correspondences, 
frame correspondences, or moving object detection. 
Results are shown on real image sequences taken by 
multiple video cameras. 

1 Introduction 

The problem of image-to-image alignment has been 
extensively studied in the literature. By "image-to- 
image alignment" we refer to the problem of densely 
estimating point correspondences between two or more 
images (either taken by a single moving camera, or by 
multiple cameras), i.e., for each pixel (x,y) in one im- 
age, find its corresponding pixel in the other image: 
V 1 ) = (x + u,y + v), where (u, v) is the spatial dis- 
placement. This paper addresses a different problem 
- the problem of "sequence-to-sequence alignment" , 
which establish correspondences both in time and in 
space between multiple sequences (as opposed to mul- 
tifle images). Namely, for each pixel (x, y) in each 
frame (time) t in one sequence, -find its correspond- 



ing frame t* and pixel (x',y') in the other sequence: 
(x\ #',£') = (x + u,y + v,t + w), where (u,v,w) is the 
spatio-temporal displacement. 

The need for sequence-to-sequence alignment ex- 
ists in many real-world scenarios, where multiple video 
cameras record information about the same scene over 
a period of time. Some examples are: News items 
commonly documented by several media crews; sports 
events covered by at least a dozen cameras recording 
the same scene from different view points; wide- area 
surveillance of the same scene by multiple cameras 
from different observation points. Grimson-et-al [7] 
suggested a few applications of multiple collaborat- 
ing sensors. Reid and Zisserman [5] combined infor- 
mation from two independent sequences taken at the 
QQth World Cup, to resolve the controversy regard- 
ing the famous goal. They manually synchronized the 
sequences, and then computed spatial alignment be- 
tween selected corresponding images (i.e., image-to- 
image alignment). This is an example where spatio- 
temporal sequence-to-sequence alignment was needed. 

Image-to-image alignment methods are inherently 
restricted to the information contained in individual 
images - the spatial variations within an image (which 
corresponds to scene appearance). However, a video 
sequence contains much more information than any in- 
dividual frame does. Scene dynamics (such as moving 
object, changes in illumination, etc) is a property that 
is inherent to the scene and common to all sequences 
taken from different video cameras. It therefore forms 
an additional powerful cue for alignment. 

Stein [6] proposed an elegant approach to estimat- 
ing spatio-temporal correspondences between two se- 
quences based on alignment of trajectories of mov- 
ing objects. Centroids of moving objects were de- 
tected and tracked in each sequence. Spatio-temporal 
alignment parameters were then seeked, which would 
bring the trajectories in the two sequences into align- 
ment. No static-background information was used in 
this step 1 . This approach is hence referred to in our 

l In a later step [6] refines the spatial alignment using static 



1 



paper as "trajectory-to-trajectory alignment' 9 . Giese 
and Poggio [3] also used trajectory-to-trajectory align- 
ment to classify human motion patterns. Both [6, 3] 
reported that using temporal information (i.e., the tra- 
jectories) alone for alignment across the sequences may 
not suffice, and can often lead to inherent ambiguities 
between temporal and spatial alignment parameters. 

This paper proposes an approach to sequence-to- 
sequence alignment, which simultaneously uses all 
available spatial and temporal information within a 
sequence. We show that when there is no temporal 
information present in the sequence, our approach re- 
duces to image- to- image alignment. However, when 
such information exists, it takes advantage of it. Sim- 
ilarly, we show that when no static spatial informa- 
tion is present, our approach reduces to trajectory- 
to-trajectory alignment. Here too, when such infor- 
mation is available, it takes advantage of it. Thus 
our approach to sequence- to- sequence alignment com- 
bines the benefits of image-to-image alignment with 
the benefits of trajectory-to-trajectory alignment, and 
is a generalization of both approaches. We show that 
it resolves many of the inherent ambiguities associated 
with each of these two classes of methods. 

We also present a specific algorithm for sequence- 
to-sequence alignment, which is a generalization of the 
direct image alignment method of [1]. It is assumed 
that the sequences are taken by stationary video cam- 
eras, with fixed (but unknown) internal and external 
parameters. Our algorithm simultaneously estimates 
spatial and temporal alignment parameters without 
requiring prior estimation of point correspondences, 
frame correspondences, moving object detection, or 
detection of illumination variations. 

The remainder of this paper is organized as follows: 
Section 2 presents our direct method for the spatio- 
temporal sequence-to-sequence alignment algorithm. 
Section 3 studies some inherent properties of sequence- 
to-sequence alignment, and compares it against image- 
to-image alignment and trajectory-to-trajectory align- 
ment. Section 4 provides selected experimental results 
on real image sequences taken by multiple unsynchro- 
nized and uncalibrated video cameras. 

2 The Sequence Alignment Algorithm 

The scenario addressed in this paper is when 
the video cameras are stationary, with fixed (but 
unknown) internal and external parameters. The 
recorded scene can change dynamically, i.e., it can in- 
clude multiple independently moving objects (there is 

baJkground information. However, the temporal alignment is 
aiready fixed at that point. — 




Figure 1: The hierarchical spatio-temporal 
alignment framework A volumetric pyramid is con- 
structed for. each input sequence t one for the reference se- 
quence (on the. right side) t and one for the second sequence 
(on the left side). The spatio-temporal alignment estima- 
tor is applied iteratively at each level. It refines the ap- 
proximation based on the residual misalignment between 
the reference volume and warped version of the second vol- 
ume (drawn as a skewed cube). The output of current level 
is propagated to the next level to be used as an initial esti- 
mate. 

no limitation on the number of moving objects or their 
motions), it can include changes in illumination over 
time, and/or any other temporal change. Temporal 
misalignment can result from the fact that the two 
input sequences can be at different frame rates (e.g., 
PAL and NTSC), or may have a time-shift (offset) 
between them (e.g., if the cameras were not activated 
simultaneously). These factors give rise to a 1-D affine 
transformation in time. Spatial misalignment results 
from the fact that the two cameras are in different 
positions and have different internal calibration pa- 
rameters. The spatial alignment can range from 2D 
parametric transformations to more general 3D trans- 
formations. 

This section presents an algorithm for sequence to 
sequence alignment. The algorithm is a generaliza- 
tion of the hierarchical direct image-to-image align- 
ment method of Bergen-et-al [1], and Irani-et-al [4]. 
In [1, 4] the spatial alignment parameters were recov- 
ered directly from image brightness variations, and the 
coarse- to-fine estimation was done using a Gaussian 
image pyramid. This is generalized here, to recover 
the spatial and temporal alignment parameters directly 
from sequence brightness variations, and the coarse- 
to-fine estimation is done within a volumetric sequence 
pyramid. An image sequence is handled as a volume 
of three dimensional data, and not as a set of two- 



dimensional images. Pixels become spatio-temporal 
"voxels" with three coordinates: (x,y,£), where x,y 
denote spatial image coordinates, and t denotes time. 
The multi-scale analysis is done both in space and in 
time. 

Fig 1 illustrates the hierarchical spatio-temporal es- 
timation framework. The rest of this section is orga- 
nized as follows: Section 2.1 describes the core step 
(the inner-loop) within the iterate-refine algorithm. 
In particular, it generalizes the image brightness con- 
straint to handle sequences. Section 2.2 presents a few 
sequence-to-sequence alignment models which were 
implemented in the current algorithm. Section 2.3 
presents the volumetric sequence- pyramid. Section 2.4 
summarizes the algorithm. 

2.1 The Sequence Brightness Error 

Let S, S' be two input image sequences, where S 
denotes the reference sequence, 5' denotes the second 
sequence. Let (x, y, t) be a spatio-temporal "voxel" in 
the reference sequence S. Let u,v be its local spatial 
displacements, and w be its temporal displacement. It 
is assumed that the observed brightness of any scene 
point (static or dynamic), is similar across cameras 
at corresponding time instances 2 , but may vary over 
time, due to independent motion of objects or due to 
changes in illumination. This is formulated as follows: 

S'(x, y, t) « S(x - u,y - v,t - w) (1) 

Expanding the right hand side to its first order Taylor 
series around (x, y,t) yields: 

S'(x, y, t) a S(x, j/, t) - [u v HV5(x, t\ (2) 

where V = ^ ^] denotes a spatio-temporal gra- 
dient. Eq. (2) directly relates the unknown displace- 
ments (ujVjtv) to measurable brightness variations 
within the sequence. 

Denote by P = {P sp atiau Ptemporai) the unknown 
alignment parameter vector. While every "voxel" 
(x,y,t) has a different local spatio-temporal displace- 
ments (u,v,w), they are all globally constrained by the 
parametric model P. Therefore, every "voxel" (x,y,t) 
provides one constraint of Eq. (2) on the global pa- 
rameters. A global constraint on P is obtained by 

2 In practice, brightness may differ across cameras. We can 
handle this by using the Laplacian pyramid (as opposed to the 
Gaussian pyramid), or otherwise by pre-filtering the sequences 
(egg., normalize to remove global mean and contrast changes 
bexween the sequences), and applying the brightness constraint 
to the filtered sequences. 



minimizing the following SSD objective function, us- 
ing least squares minimization: 

£fli?(P) = ][>(*, 3,, t;P)) 2 , (3) 

where, 

e(x, y, t\ P) = S'(x, y, *)-S(x, y, t)+[u v w] VS(x, y> t), 

(4) 

and u — u(x t y,t;P) y v — v(x,y,t;P), w = 
w(x y y,t\P). To allow for large spatio-temporal dis- 
placements (u,v t w) t the minimization of Eq. (3) is 
done within an iterative-warp coarse-to-fine frame- 
work (see Sections 2.3 and 2.4). 

Note that the objective function in Eq. (3) inte- 
grates all available spatio-temporal information in the 
sequence. Each spatio-temporal "voxel" contributes 
as much information as it reliably can to each un- 
known . For example, a "voxel" which lies on a sta- 
tionary vertical edge, (i.e., 5 X ^ 0, S v = S t — 0), 
affects only the estimation of the parameters involved 
in the horizontal displacement u{x,y,t\P). Similarly, 
a "voxel" in a uniform region (5 X = S y = 0) which un- 
dergoes a temporal change {St # 0), e.g., due to varia- 
tion in illumination, contributes only to the estimation 
of the parameters affecting the temporal displacement 
w{x,y y t\P). A highly textured "voxel" on a moving 
object (i.e., S x ^ 0, 5 y ^ 0, S t ^ 0), contributes to 
the estimation of all parameters. 
2.2 S patio- Temporal Alignment Models 

In our current implementation, P = 
(Pspatiat 7 ^temporal) was chosen to be a paramet- 
ric transformation. Let p = (x,y, 1) T denote the 
homogeneous spatial coordinates of a spatio-temporal 
"voxel" (x, y,£). Let H be the 3x3 matrix of 
the spatial parametric transformation between 
the two sequences. Denoting the rows of H by 
[Hi, H2,Hz\ T , the spatial displacement can be writ- 
ten as: u{x,y>t) = |%£ - x, and v{x,y,t) = - y. 
Note that H is common to all frames, because the 
cameras are stationary. When the two cameras have 
different frame rates (such as with NTSC and PAL) 
and possibly a time shift, a 1-D affine transformation 
suffices to model the temporal misalignment between 
the two sequences: w(t) = d\t + di (where d\ and di 
are real numbers). We have currently implemented 
two different spatio-temporal parametric alignment 
models: 

Model 1: 2D spatial affine transformation & ID 
temporal affine transformation. The spatial 2D affine 
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model is obtained by setting the third row of H to be: 
H-\0 0 11. Therefore, for 2D spatial affine and ID 
affine transformations, the unknown param- 
eters are- P = [hn h ™ h ^ h " h *\ dl M \ 
T S* unknowns. The individual voxel 

r T^ n (a\ becomes: e(x,y,t^) 

T- 3 + + *»™' 

• which is linear in all unknown parameters. 

Model 2: 21? «a*l projective J-^J^ 

• ^ oifaet In this case .(*) = d (* » 

, . 0 rnil ia be a sub-frame shift), and r - 
Z ^ *3i fat ^33 * Each spatio- 

temporal "oxel" (x,y,t) provides one constraint. 

e(x,y,t;P) = S'-S+[{ 1 j^ *H ffa? VI J 

The 2D projective transformation is not linear in 
unknown parameters, and hence requires some addi- 
ronrmanipulation. To ^.^-gT* 
Eq (5) is multiplied by the denominator , ana 

Saed Uh its value from the last iteraUon, leadmg to 
a slightly different error term: 

.p )= («3|e ol a(x, J /,t;P). W 



three benefits are discussed in [1] for the case of spatial 
alignment. Here they are extended to the tempo 

Gaussian sequence (volumetric) pyramid. The nigftes* 
SSZl level is defined as the input sequenc e^Con- 
seStive lower resolution levels are obtained by low 
Ss Bering (LPF) both in *~^J%£St 

^ofivolumetr„e^ 

Sri" ISra^STof « frames 
next ""gjtow e discussion of the tradeoffs 

fe^n spatif and temporal low-pass-filtering may 
be found in Appendix A. _ 
2.4 Summary of the Algorithm 

The iterative-warp coarse-to-fine estimation pro- 
cess is schematically described in Fig 1, and is sum 
marized below: 
1 Construct two spatio-temporal volumet- 
U ^ pyramids, one for each input ^nc - 
( So :=S),S 1 ,S 2 ..Sl and (S 0 - S ), ^ 
Set P := (usually the identity transforma- 
tion). 



where H 3 is the current estimate of H a in the iterative 
^ess. Let A and i be the current estunat* jol* 
and d, respectively. Substituting H = -ff + and 
T- i + Id into Eq. (6), and neglecting high-order 
tJ, 12 to a n^w error term ^h^^-* 

(^ <*0d). the first order term <UH, is also negligible 
and can be ignored. „„ im ~i to 

In the above implementations P was turned to 

2.3 Spatio-Temporal Volumetric Pyra 
mid 

The^timation step described in section 2.1 is em- 

u I !. Tn iterative-warp coarse-to-fine estimation 
bedded in an iterat ve-w^p ^ & 

3SS volu-elric 8 py/amid- Multi-scale an^ysis 
prSthree main benefits: (i) IW^Jg^S 
San be handled, (ii) the convergence rate is faste, s ana 
(iii) it avoids getting trapped in local minima. These 



2. For every resolution level, I = L..0, do: 

(a) Warp S{ using the current parameter esti- 
mate: S{ := warp(S' t ; P)- 

(b) Refine P according to the residual I mis- 
alignment between the reference 5, and the 
warped S[ (see Section 2.1). 

(c) Repeat steps (a) and (b) until ||AP|| < £• 

3 Propagate P to the next pyramkl level / - 1, and 
fieatUe steps (a),(b),(c) for S,_i and S ( _ r 

Th* resulting P is the spatio-temporal transformation, 
S nTSting alignment is at sub-p^e! spatia Uc- 
curacy, and sub-frame temporal accuracy. Results ^ot 
Xg ^is algorithm to real image sequences are 
shown in Section 4. 

3 Properties of Sequence-to-Sequence 

Tht^Xntudies several *f^J™£* 
of sequence-tc-sequence alignment. ^ Particular it 
to shown that sequence-to-sequence alignment is a 

»A Laplacian pyram.d can equally be used. 






(a) Sequence Si (b) Sequence S* (c)Prame from Si 

alignments 

Figure 2: Spatial ambiguities in image-to-image alignment 

ball, (c) and (d) show two corresponding frames from the two sequences. r ^ 

alignments between the two frames, some of them shown in (e), but only one of then aligns the two trajech 



(d) Frame from Sz (e) Possible image 

(a) and (b) display two sequences of a moving 
There are infinitely many valid image-to-image 



generalization of image-to-image alignment and of 
trajectory-to-trajectory alignment approaches. It is 
shown how ambiguities in spatial alignment can of- 
ten be resolved by adding temporal cues, and vice 
versa, how temporal ambiguities (reported in [6, 3]) 
can be resolved by adding spatial cues. These issues 
are discussed in Sections 3.1 and 3.2. We further show 
that temporal information is not restricted to moving 
objects. Different types of temporal events, such as 
changes in scene illumination, can contribute useful 
cues (Section 3.3). These properties are illustrated by 
examples from the algorithm presented in Section 2. 
However, the properties are general, and are not lim- . 
ited to that particular algorithm. 
3.1 Sequence-to-Sequence vs. Image-to- 
image Alignment 

This section shows that sequence-to-sequence is a 
generalization of image-to-image alignment. We first 
show that when there are no temporal changes in 
the scene, sequence-to-sequence alignment reduces to 
image-to-image alignment, with an improved signal- 
to- noise ratio. In particular it is shown that the pre- 
sented algorithm in Section 2 reduces to the image 
alignment algorithm of [1], 

When there are no temporal changes in the scene, 
all temporal derivatives within the sequence are zero: 
S t = 0. Therefore, for any voxel (x,y,t), the error 
term of Eq. (4) reduces to: 

e— (g, g, t; P)) = S' - S + [u, v) [ S s x ] = 

seq-to-seq 

= I'-I+[u,v}[ T T :]= e img ( X ,y;P) . 

img-to-img 

where, I(x, y) = S(x, y, t) is the image frame at time 
t. Therefore, the SSD function of Eq. 3 reduces to: 

| ERR seq (P) = £ x>y(t (e(x, y, t; ?))* = 
= Et (£*, y (e(*,y,*;P)) 2 ) = ^ t ERR irng {P). 




(c) Trajectory 1 (d) Trajectory 2 

Figure 3: Spatio-temporal ambiguity in 
trajectory-to-trajectory alignment This fig- 
ure shows a small airplane crossing a scene viewed by 
two cameras. The airplane trajectory does not suffice to 
uniquely determine the alignment parameters. Arbitrary 
time shifts can be compensated by appropriate spatial 
translation along the airplane motion direction. Sequence- 
to-sequence alignment, on the other hand, can uniquely 
resolves this ambiguity, as it uses both the scene dynamics 
(the plane at different locations), and the scene appearance 
(the static ground). Note that spatial information alone 
does not suffice in this case either. 



namely, the image-to-image alignment objective func- 
tion, averaged over all frames. 

We next show that when the scene does contain 
temporal variations, sequence-to-sequence uses more 
information for spatial alignment than image-to-image 
alignment has access to. In particular, there are am- 
biguous scenarios for image-to-image alignment, which 
sequence-to-sequence alignment can uniquely resolve. 
Fig. 2 illustrates a case which is ambiguous for image- 
to-image alignment. Consider a uniform background 
scene with a moving ball (Fig. 2.a and Fig. 2.b). At 
any given frame (e.g., Fig. 2.c and Fig. 2.d) all the spa- 
tial gradients are concentrated in a very small image 
region (the moving ball). In these cases, image- to- 
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Figure 4: Scene with moving objects. Rows (a) and (b) display five representative frames (0,100 \2OO,3O0 \400) 
from the reference and second sequences, respectively. The spatial misalignment is easily observed near image boundaries, 
where different static objects are visible in each sequence. The temporal misalignment is observed by comparing the position 
of the gate in frames £00. In the second sequence it is already open, while still closed in the reference sequence. Row (c) 
displays superposition of the representative frames before spatio-temporal alignment. The superposition composes the red 
and blue bands from reference sequence with the green band from the second sequence. Row (d) displays superposition of 
corresponding frames after spatio-temporal alignment. The dark pink boundaries in (d) correspond to scene regions observed 
only by the reference camera. The dark green boundaries in (d) correspond to scene regions observed only by the second 
camera. See full sequences in CVPR'2000 website: Movie 1, , Movie 4- 



image alignment cannot uniquely determine the cor- 
rect spatial transformation (see Fig. 2.e). Sequence- 
to-sequence alignment, on the other hand, does not 
suffer from spatial ambiguities in this case, as the 
spatial transformation must simultaneously bring into 
alignment all corresponding frames across the two se- 
quences, i.e., the two trajectories must be in align- 
ment. 

3.2 Sequence-to-Sequence vs. 
Trajectory-to- Trajectory Alignment 

While "trajectory-to-trajectory" alignment can 
also handle the alignment problem in Fig. 2, there are 
often cases where analysis of trajectories of temporal 
information alone does not suffice to uniquely deter- 
mine the spatio-temporal transformation between the 



two sequences. Such is the case in Fig. 3. When only 
the moving object information is considered (i.e., the 
trajectory of the airplane), then for any temporal shift, 
there exists a consistent spatial transformation be- 
tween the two sequences, which will bring the two tra- 
jectories in Figs. 3.c and 3.d into alignment. Namely, 
in this scenario, trajectory- to- trajectory alignment 
will find infinitely many valid spatio-temporal trans- 
formations. Stein [6] noted this spatio-temporal ambi- 
guity, and reported its occurrence in car-traffic scenes, 
where all the cars move in the same direction with sim- 
ilar velocities. ([3] also reported a similar problem in 
their formulation). 

While trajectory-to-trajectory alignment will find 
infinitely many valid spatio-temporal transformations 
for the scenario in Fig. 3, only one of those spatio- 
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temporal transformations will also be consistent with 
the static background (i.e., the tree and the hori- 
zon). Sequence-to-sequence alignment will therefore 
uniquely resolve the ambiguity in this case, as it 
forces both spatial and temporal information to be 
brought simultaneously into alignment across the two 
sequences. 

The direct method for sequence-to-sequence align- 
ment presented in Section 2 is only one possible 
algorithm for solving this problem. The concept 
of sequence-to-sequence alignment, however, is more 
general, and is not limited to that particular algo- 
rithm. One could, for example, extend the feature- 
based trajectory-to-trajectory alignment algorithm of 
[6] into a feature-based sequence-to-sequence align- 
ment algorithm, by adding static feature correspon- 
dences to the dynamic features. 

While feature- based methods can theoretically ac- 
count for larger spatio-temporal misalignments, it is 
important to note that the direct method suggested in 
Section 2 obtains spatio-temporal alignment between 
the two sequences without the need to explicitly sep- 
arate and distinguish between the two types of infor- 
mation - the spatial and the temporal. Moreover, it 
does not require any explicit detection and tracking 
of moving objects, nor does it need to detect features 
and explicitly establish their correspondences across 
sequences- Finally, because temporal variations need 
not be explicitly modeled in the direct method, it can 
exploit other temporal variations in the scene, such as 
changes in illumination. Such temporal variations are 
not captured by trajectories of moving objects. 

3.3 Illumination Changes as a Cue for 
Alignment 

Temporal derivatives are not necessarily a result of 
independent object motion, but can also result from 
other changes in the scene which occur over time, such 
as changes in illumination. Dimming or brightening of 
the light source are often sufficient to determine the 
temporal alignment. Furthermore, even homogeneous 
image regions contribute temporal constraints in this 
case. This is true although their spatial derivatives 
are zero, since global changes in illumination produce 
prominent temporal derivatives. 

For example, in the case of the algorithm pre- 
sented in Section 2, for a voxel in a uniform region 
(S x = S y = 0) undergoing illumination variation 
(St ^ 0), Eq. (4) provides the following constraint 
on the temporal alignment parameters: e(x,y,£;P) = 
(S%x f y,t) - S(x lVl t)) + w(x 3 y 1 t;P)S t (x 1 y 1 t). Note 
that, in general, changes in illumination need not be 



global. For example, an outdoor scene on a partly 
cloudy day, or an indoor scene with spot-lights, can be 
exposed to local changes in illumination. Such local 
changes provide additional constraints on the spatial 
alignment parameters. An example of applying our 
algorithm to sequences with only changes in illumina- 
tion is shown in Fig. 5. 



frame 200 



frame 250 



frame 300 




Figure 5: Scene with varying illumination. 

Rows (a) and (b) display three representative frames 
(200,250,300) from the reference and second sequences, 
respectively. The temporal misalignment can be observed 
at frame 250, by small differences in illumination, (c) 
displays superposition of the representative frames before 
alignment (red and blue bands from reference sequence and 
green band from the second sequence), (d) displays su- 
perposition of corresponding frames after spatio-temporal 
alignment. The accuracy of the temporal alignment is ev- 
ident from the hue in the upper left corner of frame 250, 
which is pink before alignment (frame 250. c) and white 
after temporal alignment (frame 250. d). The dark pink 
boundaries in (d) correspond to scene regions observed only 
by the reference camera. See full sequences in CVPR web- 
site: Movie 5, .. , Movie 8. 

4 Experiments 

In our experiments, two different interlaced CCD 
cameras (mounted on tripods) were used for sequence 
acquisition. Typical sequence length is several hun- 
dreds of frames. Fig. 4 shows a scene with a car driv- 
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ing in a parking lot. When the car reaches the exit, 
the gate is raised. The two input sequences Figs. 4.a 
and 4.b were taken from a distance (from two dif- 
ferent windows of a tall building). Fig. 4.c displays 
superposition of representative frames, generated by- 
mixing the red and blue bands from the reference se- 
quence with the green band from the second sequence. 
This demonstrates the initial misalignment between 
the two sequences, both in time (the sequences were 
out of synchronization, note the different timing of 
the gate being lifted in the two sequences), as well as 
in space (note the misalignment in static scene parts, 
such as in the other parked cars or at the bushes). 
Fig. 4.d shows the superposition of frames after apply- 
ing spatio-temporal alignment. The second sequence 
was spatio-temporally warped towards the reference 
sequence according to the computed parameters. The 
recovered temporal shift was 46.5 frames, and was ver- 
ified against the ground truth, obtained by auxiliary 
equipment. The recovered spatial affine transforma- 
tion indicated a translation on the order of a 1/5 of 
the image size, a small rotation, and a small scaling, 
and a small skew (due to different aspect ratios of the 
two cameras). Note the good quality of alignment de- 
spite the overall difference in brightness between the 
two input sequences. 

Fig. 5 illustrates that temporal alignment is not 
limited to motion information alone. A light source 
was brightened and then dimmed down, resulting in 
observable illumination variations in the scene. The 
cameras were imaging a picture on a wall from sig- 
nificantly different viewing angles, inducing a signifi- 
cant perspective distortion. Fig. 5. a and 5.b show a 
few representative frames from two sequences of sev- 
eral hundred frames each. The effects of illumination 
are particularly evident in the upper left corner of the 
image. Fig. 5.c shows a superposition of the repre- 
sentative frames from both sequences before spatio- 
temporal alignment. Fig. 5.d shows superposition of 
corresponding frames after spatio-temporal alignment. 
The recovered temporal offset (21.3 frames) was ver- 
ified against the ground truth. The accuracy of the 
temporal alignment is evident from the hue in the up- 
per left corner of frame 250, which is pink before align- 
ment (frame 250 x) and white after temporal align- 
ment (frame 250.d). 

5 Conclusion 

In this paper we have introduced a new approach 
to sequence-to-sequence alignment, which simultane- 
ously uses all available spatial and temporal infor- 
mation within the video sequences. We showed that 



our approach combines the benefits of image-to-image 
alignment with the benefits of trajectory- to- trajectory 
alignment, and is a generalization of both approaches. 
Furthermore, it resolves many of the inherent ambigu- 
ities associated with each of these two classes of meth- 
ods. While the approach is general, we have also pre- 
sented a specific algorithm for sequence-to-sequence 
alignment, which recovers the spatio-temporal align- 
ment parameters directly from spatial and temporal 
brightness variations within the sequence. 

The current discussion and implementation were re- 
stricted to stationary cameras, and hence used only 
two types of information cues for alignment - the scene 
dynamics and the scene appearance. We are currently 
extending our approach to handle moving cameras. 
This adds a third type of information cue for align- 
ment, which is inherent to the scene and is common 
to the two sequences - the scene geometry. 
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Appendix A: Spatio-Temporal Aliasing 

This appendix discusses the tradeoff between tem- 
poral aliasing and spatial resolution. The intensity 
values at a given pixel (xq,j/q) along time induces a 
1-D temporal signal: S( Xo , yo )(t) = S(x 0 ,y 0i t). Due 
to the object motion, a fixed pixel samples a moving 
object at different locations, denoted by the "trace 
of pixel (arn,yo)"- Thus temporal variations at pixel 
(#0)Sfo) are equal to the gray level variations along 
the trace (See Fig. 6). Denote by Atrace the spa- 
tial step size along the trace. For an object moving 
at velocity v: Atrace = vAt y where At is the time 
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CLAIMS 



1. A method for establishing correspondences in time and in space between two 
different video sequences of the same dynamic sense, recorded by stationary 
uncalibrated video cameras, wherein said method simultaneously estimates both 
spatial alignment as well as temporal synchronization (temporal alignment) between 
the two sequences, using all available spatio-temporal information. 

2. A direct method for sequence-to-sequence alignment wherein the algorithm 
simultaneously estimates spatial and temporal alignment parameters directly from 
measurable sequence quantities. 

3. A method according to claim 1 or 2, substantially as described herein in the 
specification and as illustrated in the figures. 

4. Any and all products, methods and other features of the invention as described 
herein in the specification and as illustrated in the figures. 
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